Prerequisites

Make sure you have Python installed, with the following packages:

  • pandas
  • networkx
  • matplotlib
  • hiveplot
  • IPython HTML Notebook 3.0

I recommend the Anaconda distribution of Python, for its ease of installation. You can download it here: https://store.continuum.io/cshop/anaconda/

You may wish to create a new environment for the tutorial. Using Anaconda, I would recommend doing:

conda create -n net_tutorial scipy numpy pandas networkx matplotlib ipython [notebook]

In Linux and Mac, to switch to that environment, you can do:

source activate net_tutorial

Windows users should do:

activate net_tutorial

Ensure that the packages as listed above are installed in this environment.

Before we start...

You will need to pair up!

These are the pairing criteria:

  1. Given the following list comprehension: [s for s in my_fav_things if s[‘name’] == ‘raindrops on roses’]
    1. Provide a probable data structure for s and my_fav_things.
    2. Pair up with someone who has a different answer.
  2. If you have never run an IPython notebook before, then pair up with someone who has.

Following this, behind your PyCon 2015 mini-business cards:

  • On one card, write the names of other people in the class whose names you know.
  • On the other side, write a list of cities that you've been to in the past two years. Criteria is: I have stepped out of the port of entry and explored that city for at least 1 hour.

Network Basics

All your relational problems are belong to networks. :-)

Networks, a.k.a. graphs, are an immensely useful modelling tool to model complex relational problems.

Networks are comprised of two main entities:

  • Nodes: commonly represented as circles. In the academic literature, nodes are also known as "vertices".
  • Edges: commonly represented as lines between circles

Edges denote relationships between the nodes.

In a network, if two nodes are joined together by an edge, then they are neighbors of one another.

There are generally two types of networks - directed and undirected. In undirected networks, edges do not have a directionality associated with them. In directed networks, they do.

Examples of Networks

  1. Facebook's network: Individuals are nodes, edges are drawn between individuals who are FB friends with one another. This would likely be modeled as an undirected network.
  2. Air traffic network: Airports are nodes, flights between airports are the edges. This would likely be modeled as a directed network.
  3. Stock market correlation network: Individual stocks are nodes, stocks whose prices are highly correlated together (beyond some defined correlation threshold) are joined by edges. Directed or undirected?
  4. Biologists commonly draw networks between interacting proteins. Proteins are nodes, experimentally-known interactions are the edges. Directed or undirected?

Can you think of any others?

The key questions here are as follows. How do we...:

  1. model a problem as a network?
  2. extract useful information from a network?

Take-Homes

It is my hope that when you leave this tutorial, practically, you will be equipped to:

  • Use NetworkX to construct graphs in the IPython environment.
  • Model your data using nodes and edges.
  • Compute network statistics.
  • Visualize network data using node-link diagrams, heat maps, Circos plots and Hive plots.

From a broader perspective, I hope you will be able to:

  • Think in terms of "interactions" between entities, and not just think about the entities themselves.
  • Think through statistical problems in network analysis.

Credits

Much of this work is inspired by Prof. Allen Downey (Olin College of Engineering) and Prof. Jukka-Pekka Onnela (Harvard School of Public Health).

Hive and Circos Plots' original inventor is Martin Krzywinsky of the BC Genome Sciences Center.

Circos plots were implemented with help from Justin Zabilansky (MIT).

Many thanks to the PyCon Rehearsal class for providing feedback on the material.

The Data

In this tutorial, we will go through two data sets.

The first one is a small-scale, synthetic social network between 30 individuals, to illustrate some of the basic concepts when constructing and analyzing networks. I will use this data set for the first half of the tutorial.

The second one is a larger-scale bicycle sharing data set, publicly available on the Divvy website, but also included with this tutorial. You will use this data set during the free hacking time.

We will also be constructing our own name-knowledge network in-class, as well as a city-people bipartite graph.


In [ ]: